SHOWCASING WORK IN PROGRESS
COMPLETED, BUT UNPOLISHED, WORK AVAILABLE HERE
Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.
Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.
Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.
Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.
These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).
For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).
The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).
To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Why does this matter: The extreme points may affect the performance of the predictive model.
Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.
Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.
Again, to reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Originally, the influence of capital_loss on income was statistically significant, but after the logarithmic transformation, it is not.
Here it can be seen that with a change to the skew, the confidence interval now passes through zero whereas before it did not.
This passing through zero is interpreted as the independent variable being statistically indistinguishable from zero influence on the dependent variable.
| Feature | Skewness | Mean | Variance |
|---|---|---|---|
| Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.
Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).
Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.
However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.
Note the output from scaling. age is no longer 39 but is instead 0.30137. This value is meaningful only in context of the rest of the data and not on its own.
Earlier, I transformed some categorical values into a numeric mapping. Another, perhaps more common, way to do this is to make dummy variables from the values of those factors. Pandas has a simple method, .get_dummies(), that can perform this very quickly.
To note, this will create a new variable for every value a categorical variable takes:
| someFeature | someFeature_A | someFeature_B | someFeature_C | ||
|---|---|---|---|---|---|
| 0 | B | 0 | 1 | 0 | |
| 1 | C | ----> one-hot encode ----> | 0 | 0 | 1 |
| 2 | A | 1 | 0 | 0 |
Which means the p, or number of factors, will grow, and can do so potentially in a large way.
It is also worth noting that for modeling, it is important that once value of the factor, a "base case", be dropped from the data. This is because the base case is redundant, i.e. can be infered perfectly from the other cases, and, more specifically and more detrimental to our model, it leads to multicollinearity of the terms.
In some models (e.g. logistic regression, linear regression), an assumption of no multicollinearity must hold.
After transforming with one-hot-encoding, all categorical variables have been converted into numerical features. Earlier, they were normalized (i.e. scaled between 0 and 1).
Next, for training a machine learning model, it is necessary to split the data into segments. One segment will be used for training the model, the training set, and the other set will be for testing the mode, the testing set.
A common method of splitting is to segment based on proportion of data. A general 80:20 rule is typical for training:test.
sklearn has a method that works well for this, .model_selection.train_test_split. Essentially, this randomly selects a portion of the data to segment to a training and to a testing set.
random_state: By setting a seed, option random_state, we can ensure the random splitting is the same for our model. This is necessary for evaluating the effectiveness of the model. Otherwise, we would be training and testing a model with the same proportional split (if we kept that static), but with different observations of the data.
test_size: This setting represents the proportion of the data to be tested. Generally, this is the complement (1 - x = c) of the training_size. For example, if test_size is 0.2, the test_size is 0.8.
stratify: Preserves the proportion of the label class in the split data. As an example, let 1 and 0 indicate the positive and negative cases of a label, respectively. It's possible that only positive or only negative classes exisst in either training or testing set (e.g. $\forall y \in Y_{train}, y = 1$). Better than avoid this worst case scenario, stratify will preserve the ratio of positive to negative classes in each training and testing set.
Here the data is split 80:20 with a seed set of 0 and the distribution of the label's classes preserved:
In terms of income as a predictor for donating, CharityML has stated they will most likely receive a donation from individuals whose income is in excess of 50,000/yr.
CharityML has limited funds to reach out to potential donors. Misclassifying a person as making more than 50,000yr is COSTLY for CharityML. It's more important that the model accurately predicts a person making more than 50,000/yr (i.e. true-positive) than accidentally predicting they do when they don't (i.e. false-positive).
Accuracy is a measure of the correctly predicted data points to total amount of data points:
$$Accuracy=\frac{\sum Correctly\ Classified\ Points}{\sum All\ Points}=\frac{\sum True\ Positives + \sum True\ Negatives}{\sum Observations}$$A Confusion Matrix demonstrates what a true/false positive/negative is:
| Predict 1 | Predict 0 | |
|---|---|---|
| True 1 | True Positive | False Negative |
| True 0 | False Positive | True Negative |
The errors of these are sometimes refered to as type errors:
| Predict 1 | Predict 0 | |
|---|---|---|
| True 1 | True Positive | Type 2 Error |
| True 0 | Type 1 Error | True Negative |
For this analysis, we want to avoid false positives or type 1 errors. Put differently, we prefer false negatives to false positives.
A model that meets that criteria, $False\ Negative \succ False\ Positive$, is known as preferring precision over recall, or is a high precision model.
Humorously and perhaps more understandably, these type errors can be demonstrate as such:

Precision is a measure of the amount of correctly predicted positive class to the amount of positive class predictions (correct as well as incorrect predictions of positive class):
$$Precision = \frac{\sum True\ Positives}{\sum True\ Positives + \sum False\ Positives}$$A model which avoids false positives would have a high precision value, or score. It may also be skewed toward false negatives.
Recall, sometimes refered to as a model's sensitivity, is a measure of the correctly predicted positive classes to the actual amount of positive classes (true positive and false negatives are each actual positive classes):
$$Recall = \frac{\sum True\ Positives}{\sum Actual\ Positives} = \frac{\sum True\ Positives}{\sum True\ Positives + \sum False\ Negatives}$$A mode which avoids false negatives would have a high recall value, or score. It may also be skewed toward false positives
An F-$\beta$ Score is a method of scoring a model both on precision and recall.
Where $\beta \in [0,\infty)$:
$$F_{\beta} = \left(1+\beta^{2}\right) \cdot \frac{Precision\ \cdot Recall}{\beta^{2} \cdot Precision + Recall}$$When $\beta = 0$, we get precision: $$F_{\beta=0} = \left(1+0^{2}\right) \cdot \frac{Precision\ \cdot Recall}{0^{2} \cdot Precision + Recall} = \left(1\right) \cdot \frac{Precision\ \cdot Recall}{Recall} = Precision$$
When $\beta = 1$, we get a harmonized mean of precision and recall:
$$F_{\beta=1} = \left(1+1^{2}\right) \cdot \frac{Precision\ \cdot Recall}{1^{2} \cdot Precision + Recall} = \left(2\right) \cdot \frac{Precision\ \cdot Recall}{Precision + Recall}$$... and when $\beta > 1$, we get something closer to recall:
$$F_{\beta \rightarrow \infty} = \left(1+\beta^{2}\right) \cdot \frac{Precision\ \cdot Recall}{\beta^{2} \cdot Precision + Recall} = \frac{Precision\ \cdot Recall}{\frac{\beta^{2}}{1+\beta^{2}} \cdot Precision + \frac{1}{1+ \beta^{2}} \cdot Recall}$$As $\beta \rightarrow \infty$: $$\frac{Precision\ \cdot Recall}{\frac{\beta^{2}}{1+\beta^{2}} \cdot Precision + \frac{1}{1+ \beta^{2}} \cdot Recall} \rightarrow \frac{Precision \cdot Recall}{1 \cdot Precision + 0 \cdot Recall} = \frac{Precision}{Precision} \cdot Recall = Recall$$
The Naive Bayes Classifier will be used as a benchmark model for this work.
Bayes' Theorem is as such:
$$P\left(A|B\right) = \frac{P\left(B|A\right) \cdot P\left(A\right)}{P\left(B\right)}$$It is considered naive as it assumes each feature is independent of one another.
Bayes Theorem calculates the probability of an outcome (e.g. wether an individual recieves income exceeding 50k/yr), based on the joint probabilistic distributions of certain other events (e.g. any factors we include in the model).
As an example, I propose a model that always predicts an individual makes more than 50k/yr. This model has no false negatives; it has perfect recall (recall = 1).
Note: The purpose of generating a naive predictor is simply to show what a base model without any intelligence would look like. When there is no benchmark model set, getting a result better than random choice is a similar starting point.
Since this model always predicts a 1:
1 when 1 is true), equal to the sum of the label1 when 0 is true)0 when 0 is true) as no 0s are ever predicted0 when 1 is true) as no 0s are ever predictedNote: I set $\beta = \frac{1}{2}$ as I want to penalize false positives being costly for CharityML. Recall the implications of setting the values of $\beta$ from before
WORK IN PROGRESS COMPLETED WORK AVAILABLE HERE